Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an encoding parameter to io.load_tabby #116

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

mslw
Copy link
Contributor

@mslw mslw commented Nov 21, 2023

This PR resolves #112 by adding an optional encoding parameter to io.load_tabby. The parameter can be used to specify encoding for reading tsv files.

When not specified (encoding=None), we keep the default behavior (implicitly using locale.getencoding() 1,2).

With external libraries it might be possible to guess a file encoding that produces a correct result based on the files content, but the success is not guaranteed when there are few non-ascii characters in the entire file (think: list of authors). I made an attempt with #114 but didn't like it in the end. Here, we do not attempt to guess, instead expecting the user to know the encoding they need to use.

This PR also fixes an unrelated documentation typo to satisfy the codespell checks.

Footnotes

  1. https://docs.python.org/3/library/pathlib.html#pathlib.Path.open

  2. https://docs.python.org/3/library/functions.html#open

mslw added 2 commits November 21, 2023 16:52
By default, `Path.open()` uses `locale.getencoding()` when opening the
file for reading. This has caused problems when loading files
saved (presumably on Windows) with iso-8859-1 encoding on linux (where
utf-8 is the default), see psychoinformatics-de#112

The default behaviour is maintained with `encoding=None`, and any
valid encoding name can be provided as an argument to load_tabby. The
encoding will be used for loading tsv files.

The encoding is stored as an attribute of `_TabbyLoader` rather than
passed as an input to the load functions - since they may end up being
called in a few places (when sheet import statements are found), it
would be too much passing around otherwise.

With external libraries it might be possible to guess a file encoding
that produces a correct result based on the files content, but the
success is not guaranteed when there are few non-ascii characters in
the entire file (think: list of authors). Here, we do not attempt to
guess, instead expecting the user to know the encoding they need to
use.

Ref:
https://docs.python.org/3/library/pathlib.html#pathlib.Path.open
https://docs.python.org/3/library/functions.html#open
mslw added a commit to sfb1451/tabby-utils that referenced this pull request Nov 21, 2023
This relies on a not-yet-merged patch to datalad-tabby, so we call
load_tabby without additional arguments unless the encoding is
given. If the patch gets merged, the if-else can be removed.

Related datalad-tabby change:
psychoinformatics-de/datalad-tabby#116
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Load-tabby is locale-dependent w.r.t file encoding
1 participant